This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
When manipulating data, pivoting can be understood to refer to one of two processes: 1. the creation of pivot tables, which are tables “… of statistics that summarize the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way… They arrange and rearrange (or”pivot“) statistics in order to draw attention to useful information. This leads to finding figures and facts quickly making them integral to data analysis.” https://en.wikipedia.org/wiki/Pivot_table#.
The former is a crucial step in data analysis, and is covered elsewhere (link to dplyr, summarise, etc). In this section, we will focus on the latter definition.
https://datacarpentry.org/r-socialsci/03-dplyr-tidyr/index.html
Transforming a dataset from wide to long
Data are often entered and stored in a format that might be useful for presentation, but not for analysis. Let us take the count_data dataset as an example, which is stored in a “wide” format, which means that each column is a variable and each row an observation. This is useful for presenting the information in a table or for entering data (e.g. in Excel) from case report forms. However, these typically needs to be transformed to “long” format in order to analyse and visualise.
DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Each observation in this dataset refers to the malaria counts at one of 65 facilities on a given date, ranging from 2019-03-18 to 2019-06-14. These facilties are located in one Province (North) and four Districts (Spring, Bolo, Dingo, and Barnard). The dataset provides the overall counts of malaria, as well as age-specific counts in each of three age groups - <4 years, 5-14 years, and 15 years and older.
Visualising the overall malaria counts over time poses no difficulty with the data in it’s current format:
ggplot(count_data) +
geom_col(aes(x = data_date, y = malaria_tot))However, what if we wanted to display the relative contributions of each age group to this total count? In this case, we need to ensure that the variable of interest (age group), appears in the dataset in a single column that can be passed to {ggplot2}’s “aesthetics” (aes()) function.
Consider also using the common problem whereby data are stored with dates as the columns, as in tidyr::table4a
tidyr::table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766pivot_longer()First, let’s begin by loading our packages and converting count_data to a tibble for easy printing:
pacman::p_load(tidyverse)
# Convert count_data to `tibble` for better printing
count_data <-
count_data %>%
as_tibble()
count_data
## # A tibble: 3,038 x 10
## location_name data_date submitted_date Province District `malaria_rdt_0-…
## <chr> <date> <date> <chr> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 11
## 2 Facility 2 2019-06-13 2019-06-14 North Bolo 11
## 3 Facility 3 2019-06-13 2019-06-14 North Dingo 8
## 4 Facility 4 2019-06-13 2019-06-14 North Bolo 16
## 5 Facility 5 2019-06-13 2019-06-14 North Bolo 9
## 6 Facility 6 2019-06-13 2019-06-14 North Dingo 3
## 7 Facility 6 2019-06-12 2019-06-14 North Dingo 4
## 8 Facility 5 2019-06-12 2019-06-14 North Bolo 15
## 9 Facility 5 2019-06-11 2019-06-14 North Bolo 11
## 10 Facility 5 2019-06-10 2019-06-14 North Bolo 19
## # … with 3,028 more rows, and 4 more variables: `malaria_rdt_5-14` <int>,
## # malaria_rdt_15 <int>, malaria_tot <int>, newid <int>Next, we want to use {tidyr}’s pivot_longer() function to convert the wide dataset to a long format, converting the four columns with data on malaria counts to two new columns: one which captures the variable name and one which captures the values from the cells. Since these four variables all begin with the prefix malaria_, we can make use of the handy function starts_with().
df_long <-
count_data %>%
pivot_longer(
cols = starts_with("malaria_")
)
df_long
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsHowever, we could also have specified the columns by position:
count_data %>%
pivot_longer(
cols = 6:9
)
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsor by named range:
count_data %>%
pivot_longer(
cols = `malaria_rdt_0-4`:malaria_tot
)
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsNotice that the newly created dataframe (df_long) has more rows (12,152 vs 3,038); it has become longer. In fact, it is precisely four times as long, because each row in the original dataset now represents four rows in df_long, one for each of the malaria count observations (<4y, 5-14y, 15y+, and total).
In addition to becoming longer, the new dataset has fewer columns (8 vs 10), as the data previously stored in four columns (those beginning with the prefix malaria_) is now stored in two. These two columns are given the default names of name and value, but we can override these defaults to provide more meaningful names, which can help remember what is stored within, using the names_to and values_to arguments. Let’s use the names age_group and count:
df_long <-
count_data %>%
pivot_longer(
cols = starts_with("malaria_"),
names_to = "age_group",
values_to = "counts"
)
df_long
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid age_group
## <chr> <date> <date> <chr> <chr> <int> <chr>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malaria_…
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malaria_…
## # … with 12,142 more rows, and 1 more variable: counts <int>We can now pass this new dataset to {ggplot2} to display the malaria counts by age group:
ggplot(df_long) +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)Have a look at the plot - what is wrong here? We have encountered a common problem - we have also included the total counts from the malaria_tot column, so the magnitude of each bar in the plot is twice as high as it should be.
We can handle this in a number of ways. We could simply filter it from the dataset we pass to {ggplot2}:
df_long %>%
filter(age_group != "malaria_tot") %>%
ggplot() +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)Alternatively, we could have excluded this variable when we ran pivot_longer, thereby maintaining it in the dataset as a separate variable:
count_data %>%
pivot_longer(
cols = `malaria_rdt_0-4`:malaria_rdt_15,
names_to = "age_group",
values_to = "counts"
) %>%
ggplot() +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)Transforming a dataset from long to wide
In some instances, we may wish to convert a dataset to a wider format. For this, we can use the pivot_wider() function.
A typical use case is when we want to transform the results of an analysis into a format which is more digestible for the reader. Typically, we are transforming a dataset in which the observations are spread over multiple rows to one in which each observation occupies a single row.
This introduces the useful topic of “tidy data”, in which each variable has it’s own column, each observation has it’s own row, and each value has it’s own cell. More about this topic can be found here https://r4ds.had.co.nz/tidy-data.html.
Let us use the linelist dataset. Suppose that we want to know the counts of individuals in the different age groups, by sex:
linelist <-
linelist %>%
as_tibble()
df_wide <-
linelist %>%
count(age_cat, gender)This gives us a long dataset that is great for visualisation, but not ideal for presentation in a table:
ggplot(df_wide) +
geom_col(aes(x = age_cat, y = n, fill = gender))Therefore, we can use pivot_wider() to put this into a better format for inclusion as tables in our reports. The argument names_from specifies the column from which to generate the new column names, while the argument values_from specifies the column from which to take the values to populate the cells:
table_wide <-
df_wide %>%
pivot_wider(
names_from = gender,
values_from = n
)
table_wide
## # A tibble: 8 x 4
## age_cat f m `NA`
## <fct> <int> <int> <int>
## 1 0-4 612 410 32
## 2 5-9 648 466 39
## 3 10-14 490 423 44
## 4 15-19 422 435 32
## 5 20-29 425 606 34
## 6 30-49 168 478 12
## 7 50-69 1 23 NA
## 8 <NA> NA NA 88This table is much nicer for inclusion in our reports:
table_wide %>%
janitor::adorn_totals(c("row", "col")) %>% # adds a total row and column
knitr::kable() %>%
kableExtra::row_spec(row = 9, bold = TRUE) %>%
kableExtra::column_spec(column = 5, bold = TRUE) | age_cat | f | m | NA | Total |
|---|---|---|---|---|
| 0-4 | 612 | 410 | 32 | 1054 |
| 5-9 | 648 | 466 | 39 | 1153 |
| 10-14 | 490 | 423 | 44 | 957 |
| 15-19 | 422 | 435 | 32 | 889 |
| 20-29 | 425 | 606 | 34 | 1065 |
| 30-49 | 168 | 478 | 12 | 658 |
| 50-69 | 1 | 23 | NA | 24 |
| NA | NA | NA | 88 | 88 |
| Total | 2766 | 2841 | 281 | 5888 |
Filling in missing data
In some situations after a pivot, and more commonly after a bind, we are left with gaps in some cells that we would like to fill. For example, take two datasets, each with observations for the measurement number, the name of the facility, and the case count at that time. However, the second dataset also has a variable Year. When we perform a bind_rows() to join the two datasets together, the Year variable is filled with NA for those rows where there was no prior information (i.e. the first dataset):
df1 <-
tibble::tribble(
~Measurement, ~Facility, ~Cases,
1, "Hosp 1", 66,
2, "Hosp 1", 26,
3, "Hosp 1", 8,
1, "Hosp 2", 71,
2, "Hosp 2", 62,
3, "Hosp 2", 70,
1, "Hosp 3", 47,
2, "Hosp 3", 70,
3, "Hosp 3", 38,
)
df1
## # A tibble: 9 x 3
## Measurement Facility Cases
## <dbl> <chr> <dbl>
## 1 1 Hosp 1 66
## 2 2 Hosp 1 26
## 3 3 Hosp 1 8
## 4 1 Hosp 2 71
## 5 2 Hosp 2 62
## 6 3 Hosp 2 70
## 7 1 Hosp 3 47
## 8 2 Hosp 3 70
## 9 3 Hosp 3 38
df2 <-
tibble::tribble(
~Year, ~Measurement, ~Facility, ~Cases,
2000, 1, "Hosp 4", 82,
2001, 2, "Hosp 4", 87,
2002, 3, "Hosp 4", 46
)
df2
## # A tibble: 3 x 4
## Year Measurement Facility Cases
## <dbl> <dbl> <chr> <dbl>
## 1 2000 1 Hosp 4 82
## 2 2001 2 Hosp 4 87
## 3 2002 3 Hosp 4 46
df_combined <-
bind_rows(df1, df2) %>%
arrange(Measurement, Facility)
df_combined
## # A tibble: 12 x 4
## Measurement Facility Cases Year
## <dbl> <chr> <dbl> <dbl>
## 1 1 Hosp 1 66 NA
## 2 1 Hosp 2 71 NA
## 3 1 Hosp 3 47 NA
## 4 1 Hosp 4 82 2000
## 5 2 Hosp 1 26 NA
## 6 2 Hosp 2 62 NA
## 7 2 Hosp 3 70 NA
## 8 2 Hosp 4 87 2001
## 9 3 Hosp 1 8 NA
## 10 3 Hosp 2 70 NA
## 11 3 Hosp 3 38 NA
## 12 3 Hosp 4 46 2002fill()In this case, Year is a useful variable to include, particularly if we want to explore trends over time. Therefore, we use fill() to fill in those empty cells, by specifying the column to fill and the direction (in this case up):
df_combined %>%
fill(Year, .direction = "up")
## # A tibble: 12 x 4
## Measurement Facility Cases Year
## <dbl> <chr> <dbl> <dbl>
## 1 1 Hosp 1 66 2000
## 2 1 Hosp 2 71 2000
## 3 1 Hosp 3 47 2000
## 4 1 Hosp 4 82 2000
## 5 2 Hosp 1 26 2001
## 6 2 Hosp 2 62 2001
## 7 2 Hosp 3 70 2001
## 8 2 Hosp 4 87 2001
## 9 3 Hosp 1 8 2002
## 10 3 Hosp 2 70 2002
## 11 3 Hosp 3 38 2002
## 12 3 Hosp 4 46 2002We can rearrange the data so that we would need to fill in a downward direction:
df_combined <-
df_combined %>%
arrange(Measurement, desc(Facility))
df_combined
## # A tibble: 12 x 4
## Measurement Facility Cases Year
## <dbl> <chr> <dbl> <dbl>
## 1 1 Hosp 4 82 2000
## 2 1 Hosp 3 47 NA
## 3 1 Hosp 2 71 NA
## 4 1 Hosp 1 66 NA
## 5 2 Hosp 4 87 2001
## 6 2 Hosp 3 70 NA
## 7 2 Hosp 2 62 NA
## 8 2 Hosp 1 26 NA
## 9 3 Hosp 4 46 2002
## 10 3 Hosp 3 38 NA
## 11 3 Hosp 2 70 NA
## 12 3 Hosp 1 8 NA
df_combined <-
df_combined %>%
fill(Year, .direction = "down")
df_combined
## # A tibble: 12 x 4
## Measurement Facility Cases Year
## <dbl> <chr> <dbl> <dbl>
## 1 1 Hosp 4 82 2000
## 2 1 Hosp 3 47 2000
## 3 1 Hosp 2 71 2000
## 4 1 Hosp 1 66 2000
## 5 2 Hosp 4 87 2001
## 6 2 Hosp 3 70 2001
## 7 2 Hosp 2 62 2001
## 8 2 Hosp 1 26 2001
## 9 3 Hosp 4 46 2002
## 10 3 Hosp 3 38 2002
## 11 3 Hosp 2 70 2002
## 12 3 Hosp 1 8 2002This dataset is now useful for plotting:
ggplot(df_combined) +
aes(Year, Cases, fill = Facility) +
geom_col()But less useful for presenting in a table, so let’s practice converting this long, untidy dataframe into a wider, tidy dataframe:
df_combined %>%
pivot_wider(
id_cols = c(Facility, Year, Cases),
names_from = "Year",
values_from = "Cases"
) %>%
arrange(Facility) %>%
janitor::adorn_totals(c("row", "col")) %>%
knitr::kable() %>%
kableExtra::row_spec(row = 5, bold = TRUE) %>%
kableExtra::column_spec(column = 5, bold = TRUE) | Facility | 2000 | 2001 | 2002 | Total |
|---|---|---|---|---|
| Hosp 1 | 66 | 26 | 8 | 100 |
| Hosp 2 | 71 | 62 | 70 | 203 |
| Hosp 3 | 47 | 70 | 38 | 155 |
| Hosp 4 | 82 | 87 | 46 | 215 |
| Total | 266 | 245 | 162 | 673 |
N.B. In this case, we had to specify to only include the three variables Facility, Year, and Cases as the additional variable Measurement would interfere with the creation of the table:
df_combined %>%
pivot_wider(
names_from = "Year",
values_from = "Cases"
) %>%
knitr::kable()| Measurement | Facility | 2000 | 2001 | 2002 |
|---|---|---|---|---|
| 1 | Hosp 4 | 82 | NA | NA |
| 1 | Hosp 3 | 47 | NA | NA |
| 1 | Hosp 2 | 71 | NA | NA |
| 1 | Hosp 1 | 66 | NA | NA |
| 2 | Hosp 4 | NA | 87 | NA |
| 2 | Hosp 3 | NA | 70 | NA |
| 2 | Hosp 2 | NA | 62 | NA |
| 2 | Hosp 1 | NA | 26 | NA |
| 3 | Hosp 4 | NA | NA | 46 |
| 3 | Hosp 3 | NA | NA | 38 |
| 3 | Hosp 2 | NA | NA | 70 |
| 3 | Hosp 1 | NA | NA | 8 |